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Abstract 


We propose a class of Bayesian networks appropriate for structured prediction problems where the 
Bayesian network’s model structure is a function of the predicted output structure. These incremen- 
tal sigmoid belief networks (ISBNs) make decoding possible because inference with partial output 
structures does not require summing over the unboundedly many compatible model structures, due 
to their directed edges and incrementally specified model structure. ISBNs are specifically targeted 
at challenging structured prediction problems such as natural language parsing, where learning the 
domain’s complex statistical dependencies benefits from large numbers of latent variables. While 
exact inference in ISBNs with large numbers of latent variables is not tractable, we propose two ef- 
ficient approximations. First, we demonstrate that a previous neural network parsing model can be 
viewed as a coarse mean-field approximation to inference with ISBNs. We then derive a more ac- 
curate but still tractable variational approximation, which proves effective in artificial experiments. 
We compare the effectiveness of these models on a benchmark natural language parsing task, where 
they achieve accuracy competitive with the state-of-the-art. The model which is a closer approxi- 
mation to an ISBN has better parsing accuracy, suggesting that ISBNs are an appropriate abstract 
model of natural language grammar learning. 


Keywords: Bayesian networks, dynamic Bayesian networks, grammar learning, natural language 
parsing, neural networks 


1. Introduction 


In recent years, there has been increasing interest in structured prediction problems, that is, clas- 
sification problems with a large (or infinite) structured set of output categories. The set of output 
categories are structured in the sense that useful generalisations exist across categories, as usually 
reflected in a structured representation of the individual categories. For example, the output cate- 
gories might be represented as arbitrarily long sequences of labels, reflecting generalisations across 
categories which share similar sets of sub-sequences. Often, given an input, the structure of the 
possible output categories can be uniquely determined by the structure of the input. For example 
in sequence labelling tasks, all possible output categories are label sequences of the same length as 
the input sequence to be labelled. In this article, we investigate structured classification problems 
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where this is not true; the structure of the possible output categories is not uniquely determined by 
the input to be classified. The most common type of such problems is when the input is a sequence 
and the output is a more complex structure, such as a tree. In reference to this case, we will refer 
to problems where the output structure is not uniquely determined by the input as “parsing prob- 
lems”. Such problems frequently arise in natural language processing (e.g., prediction of a phrase 
structure tree given a sentence), biology (e.g., protein structure prediction), chemistry, or image 
processing. We will focus on the first of these examples, natural language parsing. The literature 
on such problems clearly indicates that good accuracy cannot be achieved without models which 
capture the generalisations which are only reflected in the output structure. For example, in English 
sentences, if a noun is parsed as the subject of a verb, then these words must be compatible in their 
singular/plural markings, independent of whether they are near each other in the input sentence. 

In addition to limiting the scope of this article to parsing problems, we focus on tasks where the 
training data specifies the output structure, but the labelling of this structure is not fully annotated. 
While the unannotated labelling may not be evaluated in the task, by assuming incomplete labelling 
we allow our models to capture generalisation which are not directly reflected in the labelled output 
structure given for training. For example, the training data for natural language parsing problems 
is generally assumed to be a tree, but assuming that all generalisations can be expressed in terms 
of one-level fragments of the tree leads to poor empirical performance. However, much better 
performance can be achieved with such a model by extending the labelling to include features of the 
structural context (Charniak, 2000). Because we want to learn the necessary additional labelling, 
we need to solve a limited form of grammar induction. 

Graphical models provide the formal mechanisms needed to learn and reason about incomplete 
labelling, using latent variables. They also provide the formal mechanisms needed to specify the 
statistical dependencies implied by the structure of a single output category. However, these mech- 
anisms are not sufficient to specify a complete probability model for a parsing problem, because 
we need to specify the statistical dependencies for the complete space of possible output categories. 
As we will discuss in Section 3, even graphical models for unbounded sequence labelling, such as 
dynamic Bayesian networks, are in general not adequate for this task, because they are limited to 
finite-state models. 

There are well established methods for specifying probabilistic models of parsing problems, 
based on grammar formalisms, such as probabilistic context-free grammars (PCFGs). The grammar 
formalism defines how the complete space of possible pairs of an input sequence with an output 
structure can be specified as a set of sequences of decisions about the input-output pair. Each 
possible sequence of decisions, called a derivation, specifies a single input-output pair (e.g., phrase 
structure tree or protein structure). The probability model is then defined in terms of probabilities 
for each decision. In its most general form, these decision probabilities are conditioned on anything 
from the unbounded history of previous decisions: 


P(T) = P(D',..., D") =[[PO Ds .2.4D™), (1) 


where T is the input-output structure and D!,...,D"" is its equivalent sequence of decisions. 

In PCFGs, the context-free assumption means that only a bounded amount of the history 
D',...,D‘~! is relevant to the probability for decision D‘. The context-free assumption only al- 
lows statistical dependencies within each bounded one-level subtree of the output tree, so two such 
subtrees can only interact through the bounded choice of label for the node they share, if any. Be- 
cause the context-free assumption is defined in terms of the output structure, not in terms of the 
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input sequence, which decisions in the history are relevant depends on the output structure specified 
by the derivation. In graphical models, such a specification of which decisions are statistically de- 
pendent on which other decisions is called the “model structure”. Thus PCFGs, like other grammar 
formalisms, are examples of models where the model structure is a function of the output structure, 
not just of the input sequence. This is the fundamental distinction between models of parsing prob- 
lems and models of sequence labelling problems, and it will be central to our discussions in this 
article. 

The most common approach to building probability models for parsing problems is to use 
PCFGs without any latent variables (e.g., Charniak, 2000; Collins, 1999; Durbin et al., 1998), but 
this approach relies on hand-built sets of features to represent the unbounded decision histories 
in (1). Latent probabilistic context-free grammars (LPCFGs) (Matsuzaki et al., 2005) extend the 
node labels of PCFGs with latent annotations, but previous proposals have successfully induced 
only a small number of latent annotations. 

An alternative proposal to extending the labelling of parse trees is to use the hidden units of a 
neural network (Henderson, 2003). In the model of Henderson (2003), vectors of hidden unit values 
decorate the positions ¢ in the derivation sequence, and are used to encode features of the unbounded 
derivation history D!,...,D’~!. As with LPCFGs, the pattern of interdependencies between layers 
of hidden units is a function of the output structure, making it appropriate for parsing problems. But 
unlike LPCFGs, the pattern of interdependencies is not required to respect the context-free assump- 
tion. This model achieved state-of-the-art results, but there is no clear probabilistic semantics for 
the induced hidden representations. 

In this article, we propose a class of graphical models which we call incremental sigmoid belief 
networks (ISBNs), which are closely related to the neural network of Henderson (2003), but which 
have a clear probabilistic semantics for all their variables. ISBNs are a kind of sigmoid belief 
network (Neal, 1992), but are dynamic models and have an incrementally specified set of statistical 
dependencies. Each position in the decision sequence has a vector of latent state variables, which 
are statistically dependent on variables from previous positions via a pattern of edges determined 
by the previous decisions. This incrementally specified model structure allows ISBNs to capture 
the generalisations which are only reflected in the output structure, such as the tendency towards 
correlations which are local in the output structure, which motivates the context-free assumption of 
PCFGs. 

Allowing the model structure to depend on the output structure means that the complete model 
structure is not known until the complete output derivation is known. In general, this can complicate 
decoding (i.e., parsing) because computing probabilities for sub-derivations requires marginalising 
out the unknown portion of the model structure, which in the worst case could require summing over 
an unbounded number of possible model structures. The properties of ISBNs avoid this problem 
because the probability of a derivation prefix is always independent of the unknown portion of the 
model structure, as discussed in Section 3. 

Despite this simplification, exact inference (i.e., computing probabilities) is not in general 
tractable in ISBNs, because they allow large vectors of latent variables in a heavily intercon- 
nected directed model. We demonstrate the practical applicability of ISBN models by providing 
efficient approximate inference methods. We consider two forms of approximation for ISBNs, a 
feed-forward neural network approximation (NN) and a form of mean field approximation (Saul 
and Jordan, 1999). In Section 5, we first show that the neural network model in Henderson (2003) 
can be viewed as a coarse approximation to inference with ISBNs. We then propose an incremental 
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mean field method (IMF), which provides an improved approximation but remains tractable. Both 
these approximations give us valid probability models. 

In Section 7, we present two empirical evaluations. In the first experiment, we trained both of 
the approximation models on artificial data generated from random ISBNs. The NN model achieves 
a 60% average relative error reduction over a baseline model and the IMF model achieves a further 
27% average relative error reduction over the NN model. These results demonstrate that the distri- 
bution of output structures specified by an ISBN can be approximated, that these approximations 
can be learnt from data, and that the IMF approximation is indeed better than the NN approxi- 
mation. In the second experiment, we apply both of the approximation models to phrase structure 
parsing with data from the Wall Street Journal Penn Treebank (Marcus et al., 1993). The IMF model 
achieves statistically significant error reduction of about 8% over the NN model. Results of the IMF 
model are non-significantly worse (less than 1% relative error increase) than the results of one of 
the best known history-based models of parsing (Charniak, 2000). We argue that this correlation 
between better approximation and better accuracy suggests that ISBNs are a good abstract model 
for structured prediction. 

Section 8 discusses related work not covered in the rest of this article. It focuses particularly on 
previous work on LPCFGs. 


2. Inference with Sigmoid Belief Networks 


Before defining ISBNs, we provide background on sigmoid belief networks. A sigmoid belief net- 
work (SBN) (Neal, 1992) is a type of Bayesian network. Bayesian networks are directed acyclic 
graphs where the nodes are variables and the edges specify statistical dependencies between vari- 
ables. SBNs have binary variables which have conditional probability distributions (CPDs) of the 
form: 


P(S; = 1|Par(S;)) = 9( Ł JijS;), (2) 
S;€Par(S;) 


where Par(S;) is the set of variables with edges directed to S;, © denotes the logistic sigmoid func- 
tion o(x) = 1/(1+e~*), and Jj; is the weight for the edge from variable S$; to variable S;.! SBNs 
are similar to feed-forward neural networks, but unlike neural networks, SBNs have a precise prob- 
abilistic semantics of their hidden variables. In ISBNs we consider a generalised version of SBNs 
where we allow variables with any range of discrete values. The normalised exponential function is 
used to define the CPDs at these nodes: 


l . 
E eee 3) 
Lv exp(Ls,crar(s;) Wy pSi) 
where W’ is the weight matrix for the variable S;. 

Exact inference with all but very small SBNs is not tractable. Initially sampling methods were 
used (Neal, 1992), but they are also not feasible for large networks, especially for the dynamic mod- 
els of the type described in Section 4. Variational methods have also been proposed for approximat- 
ing SBNs (Saul et al., 1996; Saul and Jordan, 1999). The main idea of variational methods (Jordan 








1. For convenience, where possible, we will not explicitly include bias terms in expressions, assuming that every latent 
variable in the model has an auxiliary parent variable set to 1. 
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et al., 1999) is, roughly, to construct a tractable approximate model with a number of free parame- 
ters. The free parameters are set so that the resulting approximate model is as close as possible to 
the original model for a given inference problem. 

The simplest example of a variational method is the mean field method, originally introduced in 
statistical mechanics and later applied to neural networks in Hinton et al. (1995). Let us denote the 
set of visible variables in the model by V and latent (hidden) variables by H = hy,...,h;. The mean 
field method uses a fully factorised distribution Q(H|V) =]];Q;(hi|V) as the approximate model, 
where each Q; is the distribution of an individual latent variable. The independence between the 
variables h; in this approximate distribution Q does not imply independence of the free parameters 
which define the Q;. These parameters are set to minimise the Kullback-Leibler divergence between 
the approximate distribution Q(H|V) and the true distribution P(H |V ) or, equivalently, to maximise: 


P(H,V) 
Q(A|V) 





Ly =} Q(A|V) In (4) 
H 

The expression Ly is a lower bound on the log-likelihood InP(V). It is used in the mean field 

theory (Saul and Jordan, 1999) as an approximation of the log-likelihood. However, in our case of 

dynamic graphical models, as explained later, we have to use a different approach which allows us 

to construct an incremental structured prediction method without needing to introduce the additional 

parameters proposed in Saul and Jordan (1999), as we will discuss in Section 5.3. 


3. Incrementally Specifying Model Structure 


We want to extend SBNs to make them appropriate for modelling parsing problems. As discussed 
in the introduction, this requires being able to model arbitrarily long decision sequences D}, ..., D”, 
and being able to specify the pattern of edges (the model structure) as a function of the chosen 
output structure. In this section, we define how incremental sigmoid belief networks specify such 
model structures. 

To extend SBNs for processing arbitrarily long sequences, such as the derivation decision se- 
quence D!,..., D", we use dynamic models. This gives us a form of dynamic Bayesian network 
(DBN). To handle unboundedly long sequences, DBNs specify a Bayesian network template which 
gets instantiated for each position in the sequence, thereby constructing a Bayesian network which 
is as large as the sequence is long. This constructed Bayesian network is illustrated in the rightmost 
graph of Figure 1, where the repeated two-box pattern is the template, and the left-to-right order 
is the derivation order. This template instantiation defines a new set of variables for each position 
in the sequence, but the set of edges and parameters for these variables are the same as in other 
positions. The edges which connect variables instantiated for different positions must be directed 
forward in the sequence, thereby allowing a temporal interpretation of the sequence. DBNs based on 
sigmoid belief networks were considered in Sallans (2002) in the context of reinforcement learning. 
Normally, DBNs only allow edges between adjacent (or a bounded window of) positions, which 
imposes a Markov assumption on statistical dependencies in the Bayesian network. 

The problem with only allowing edges between variables instantiated at positions which are ad- 
jacent (or local) in the decision sequence is that this does not allow the model structure to adequately 
reflect the correlations found in parsing problems. In particular, in many domains, correlations tend 
to be local in the output structure, even when they are not local in the derivation sequence for that 
structure. To capture these correlations in the statistical dependencies learnt by the model, we want 
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Figure 1: Illustration of the predictive LR derivation of an output structure and its associated incre- 
mental specification of an ISBN model structure (ordered top-to-bottom, left-to-right). 


Dotted lines indicate the top of the parser’s stack at each derivation decision in the model 
structure. 


the edges of the model to reflect locality in the output structure. This requires specifying edges 
based on the actual outputs in the decision sequence D',...,D’", not just based on adjacency in this 
sequence. 

We constrain this edge specification so that a decision D” can only effect the placement of 
edges whose destination variable is at a position t >t’ after the decision D”. This gives us a form 
of switching model (Murphy, 2002), where each decision switches the model structure used for the 
remaining decisions. We allow the incoming edges for a given position to be any discrete func- 
tion of the sequence of decisions which precede that position. For this reason, we call our model 
an “incremental” model, not just a dynamic model; the structure of the Bayesian network is deter- 
mined incrementally as the decision sequence proceeds. This incremental specification of the model 
structure is illustrated in Figure 1 (the directed graphs), along with the partial output structures in- 
crementally specified by the derivation (the trees). In Figure 1, dotted lines associate a position’s 
instantiated template with the node in the output structure which is on top of the parser’s stack when 
making that position’s decision. Note that the incoming edges for a position’s instantiated template 
reflect edges between the associated nodes in the partial output structure. 

Any discrete function can be used to map the preceding sequence of decisions to a set of incom- 
ing edges for a given decision. In general, we can characterise this function in terms of an automaton 
which reads derivations and deterministically outputs model structures. For every derivation prefix 
D',...,D'~', the automaton outputs a set of labelled positions in the derivation prefix. For each 
labelled position (t —c,r) in this set, label r determines which variables instantiated at that position 
are linked to which variables instantiated at the current position t, and with which parameters.” For 





2. In all our models to date, we have respected the additional constraint that there is at most one labelled position in the 
set for each label r, so the size of the set is bounded. We do not impose this constraint here because the model is still 
well defined without it, but we do not have empirical evidence about the effect of removing it. 
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example, the ISBN illustrated in Figure 1 uses a push-down automaton to compute which output 
structure nodes are currently important (e.g., the top and next-to-top nodes on the automaton’s stack) 
and specifies conditional dependencies between the current decision and previous decisions where 
these nodes were on the top of the stack. By using a push-down automaton, this model is able to 
express non-Markovian (e.g., context-free) regularities in the derivation sequences. 

Previous applications of switching models to DBNs (e.g., Murphy, 2002) have allowed statistical 
dependencies to be a function of the output, but only of the output from the immediately preceding 
position in the sequence, and therefore have only allowed switching between a bounded number of 
alternatives. Because the number of switched alternatives is bounded, the whole set of alternatives 
can be expressed as a single bounded model, whose CPDs incorporate the discrete switching. Thus, 
switching does not allow us to specify any models which could not be specified with a complicated 
DBN, so switching DBNs also impose some form of Markov assumption. In terms of the automata 
discussed above, this means that switching DBNs can be expressed using finite-state automata, so 
would only be appropriate for problems with a regular-language structure to their output categories. 
This limitation does not give us sufficient power to express the kinds of output-conditioned statis- 
tical dependencies we need for parsing problems in general. Therefore, it is crucial to distinguish 
between standard dynamic models and our incremental models. 

Incremental sigmoid belief networks allow the model structure to depend on the output structure 
without overly complicating the inference of the desired conditional probabilities P(D'|D',...,D‘—'). 
Computing this probability requires marginalising out the unknown model structure for the portion 
of the Bayesian network which follows position f. In general, this could require explicitly summing 
over multiple possible model structures, or in the worst case summing over the unbounded number 
of possible model structures. ISBNs avoid summing over any of these possible model structures be- 
cause in ISBNs P(D'|D!,...,D’~') is independent of all model structure which follows position t. 
This can be proved by considering two properties of ISBNs. At position f in the sequence, the only 
edges whose placement are not uniquely determined by D!,...,D‘~! have their destinations after 
t. Also, none of the variables after t are visible (i.e., have their values specified in Dey, 
Therefore none of the edges whose placement is not yet known can have any impact on the infer- 
ence of P(D'|D!,...,D‘~'), as follows directly from well known properties of Bayesian networks. 
This property implies that each individual Bayesian network depicted in Figure 1 can be used to 
compute the conditional probability of its next derivation decision, and it will give the same answer 
as if the same conditional probability were computed in the final Bayesian network at the end of the 
derivation, or indeed in any such valid continuation. 

The use of directed edges to avoid the need to sum over unknown model structures can also 
be seen in Hidden Markov Models (HMMs). Given a sequence prefix, we can use an HMM to 
infer the probability of the following element of the sequence. This distribution is not dependent 
on the total length of the sequence, which would be needed to draw the complete HMM model 
for the sequence. Note that this property does not hold for undirected graphical models, such as 
Conditional Random Fields (Lafferty et al., 2001). Rohanimanesh et al. (2009) investigate inference 
in undirected models with edges that are a function of the output structure, but the solutions are 
approximate and computationally expensive. 

The incremental specification of model structure can also be seen in LPCFGs. Given a top- 
down left-to-right derivation of a phrase structure tree, the dependencies between LPCFG derivation 
decisions have the same structure as the phrase structure tree, but with LPCFG rules (one-level 
subtrees) labelling each node of this derivation tree. The number of branches at a node in the 
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derivation tree is determined by the rule which is chosen to label that node, thereby incrementally 
specifying the complete derivation tree. If we expressed an LPCFG as a graphical model, the model 
structures would have the same general form as the derivation trees, so the model structure would 
also be incrementally specified. Also, the edges in this graphical model would need to be directed, 
because LPCFG rule probabilities are locally normalised. Therefore LPCFG can also be thought of 
as Bayesian networks with incrementally specified model structure. The differences between ISBNs 
and LPCFG will be discussed in the next section and Section 8. 

As illustrated by the above examples, the argument for the incremental specification of model 
structure can be applied to any Bayesian network architecture, not just sigmoid belief networks. We 
focus on ISBNs because, as shown in Section 5, they are closely related to the empirically successful 
neural network models of Henderson (2003). This previous work has shown that the combination 
of logistic sigmoid hidden units and having a model structure which reflect locality in the output 
structure results in a powerful form of feature induction. The edges from hidden units to hidden 
units allow information to propagate beyond the notion of immediate structural locality defined in 
the model, but the logistic sigmoid ensures a bias against propagating information through long 
chains of hidden units, thereby providing a soft but domain-appropriate bias to feature induction. 


4. The Probabilistic Model of Structured Prediction 


In this section we complete the definition of incremental sigmoid belief networks for grammar 
learning. We only consider joint probability models, since they are generally simpler and, unlike 
history-based conditional models, do not suffer from the label bias problem (Bottou, 1991). Also, 
in many complex predication tasks, such as phrase structure parsing, many of the most accurate 
models make use of a joint model, either in reranking or model combinations (e.g., Charniak and 
Johnson, 2005; Henderson, 2004). 

We use a history-based probability model, as in Equation (1), but instead of treating each D' 
as an atomic decision, it will be convenient below to further split it into a sequence of elementary 
decisions D’ = d},...,di: 


P(D'|D',...,D'"') =T] P(di a(t, 4), 
k 


where h(t,k) denotes the decision history D!,...,D’~',d‘,...,di_,. For example, a decision to 
create a new node in a labelled output structure can be divided into two elementary decisions: 
deciding to create a node and deciding which label to assign to it. 

An example of the kind of graphical model we propose is illustrated in Figure 2. It is organised 
into vectors of variables: latent state variable vectors S” = s, teh $, representing an intermediate 
state at position rt’, and decision variable vectors D”, representing a decision at position t’, where 
t' <t. Variables whose value are given at the current decision (t,k) are shaded in Figure 2; latent 
and current decision variables are left unshaded. 

As illustrated by the edges in Figure 2, the probability of each state variable si depends on all 
the variables in a finite set of relevant previous state and decision vectors, but there are no direct 
dependencies between the different variables in a single state vector. As discussed in Section 3, 
this set of previous state and decision vectors is determined by an automaton which runs over the 
derivation history D!,...,D‘~! and outputs a set of labelled positions in the history which are con- 
nected to the current position t. For each pair (t — c,r) in this set, r represents a relation between 
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Figure 2: ISBN for estimating P(d;,|h(t,k)). 


position ¢ and the position t—c in the history. We denote by r(t—c,t) the predicate which returns 
true if the position t—c with the relation label r is included in the set for t, and false otherwise. 
In general this automaton is allowed to perform arbitrary computations, as specified by the model 
designer. For example, it could select the most recent state where the same output structure node 
was on the top of the automaton’s stack, and a decision variable representing that node’s label. Each 
such selected relation r has its own distinct weight matrix for the resulting edges in the graph, but 
the same weight matrix is used at each position where the relation is relevant (see Section 7.2 for 
examples of relation types we use in our experiments). 

We can write the dependency of a latent variable si on previous latent variable vectors and a 
decision history as: 


P(s HS 2,5 Ae n)=o( Ja YI sit Pi) ; (5) 
rtrt) J k k 
where J” is the latent-to-latent weight matrix for relation r and B” is the decision-to-latent weight 
matrix for relation r and elementary decision k. If there is no previous step t/ < t which is in relation 
r to the time step f, that is, r(t',t) is false for all r’, then the corresponding relation r is skipped in 
the summation. For each relation r, the weight Jj; determines the influence of the jth variable in the 
related previous latent vector S” on the distribution of the ith variable of the considered latent vector 
S'. Similarly, Br a defines the influence of the past decision dt on the distribution of the considered 


latent vector variable s‘. 

In the previous paragraph we defined the conditional distribution of the latent vector variables. 
Now we describe the distribution of the decision vector D' = d},...dj,. As indicated in Figure 2, the 
probability of each elementary decision dj, depends both on the current latent vector S’ and on the 
previously chosen elementary action d,_, from D’. This probability distribution has the normalised 
exponential form: 


Pris x) (d) exp (Lj Wass) 
La Pice.x (d) exp(L Ways) 





P(d,=4|S',dy_,) = (6) 
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where q x) is the indicator function of the set of elementary decisions that can possibly follow the 
last decision in the history h(t,k), and the Wz; are the weights of the edges from the state variables. 
® is essentially switching the output space of the elementary inference problems P(d;, = d|S',dj_,) 
on the basis of the previous decision di_,. For example, in a generative history-based model of 
natural language parsing, if decision di was to create a new node in the tree, then the next possible 
set of decisions defined by pç 2) will correspond to choosing a node label, whereas if decision d} 
was to generate a new word then ®,; 2) will select decisions corresponding to choosing this word. 
Given this design for using ISBNs to model derivations, we can compare such ISBN models to 
LPCFG models. As we showed in the previous section, LPCFGs can also be thought of as Bayesian 
networks with incrementally specified model structure. One difference between LPCFGs and IS- 
BNs is that LPCFGs add latent annotations to the symbols of a grammar, while ISBNs add latent 
annotations to the states of an automaton. However, this distinction is blurred by the use of gram- 
mar transforms in LPCFG models, and the many equivalences between grammars and automata. 
But certainly, the automata of ISBNs are much less constrained than the context-free grammars of 
LPCFGs. Another distinction between LPCFGs and ISBNs is that LPCFGs use latent annotations 
to split symbols into multiple atomic symbols, while ISBNs add vectors of latent variables to the 
existing symbol variables. The structure of the similarities between vectors is much richer than the 
structure of similarities between split atomic symbols, which gives ISBNs a more structured latent 
variable space than LPCFGs. This makes learning easier for ISBNs, allowing the induction of more 
informative latent annotations. Both these distinctions will be discussed further in Section 8. 


5. Approximating Inference in ISBNs 


Exact inference with ISBNs is straightforward, but not tractable. It involves a summation over all 
possible variable values for all the latent variable vectors. The presence of fully connected latent 
variable vectors does not allow us to use efficient belief propagation methods. Even in the case of 
dynamic SBNs (i.e., Markovian models), the large size of each individual latent vector would not 
allow us to perform the marginalisation exactly. This makes it clear that we need to develop meth- 
ods for approximating the inference problems required for structured prediction. Standard Gibbs 
sampling (Geman and Geman, 1984) is also expensive because of the huge space of variables and 
the need to resample after making each new decision in the sequence. It might be possible to de- 
velop efficient approximations to Gibbs sampling or apply more complex versions of Markov Chain 
Monte-Carlo techniques, but sampling methods are generally not as fast as variational methods. In 
order to develop sufficiently fast approximations, we have investigated variational methods. 

This section is structured as follows. We start by describing the application of the standard 
mean field approximation to ISBNs and discuss its limitations. Then we propose an approach to 
overcome these limitations, and two approximation methods. First we show that the neural network 
computation used in Henderson (2003) can be viewed as a mean field approximation with the added 
constraint that computations be strictly incremental. Then we relax this constraint to build more 
accurate but still tractable mean field approximation. 


5.1 Applicability of Mean Field Approximations 


In this section we derive the most straightforward way to apply mean field methods to ISBN. Then 
we explain why this approach is not feasible for structured prediction problems of the scale of 
natural language parsing. 
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The standard use of the mean field theory for SBNs (Saul et al., 1996; Saul and Jordan, 1999) 
is to approximate probabilities using the value of the lower bound Ly from expression (4) in Sec- 
tion 2. To obtain a tighter bound, as we explained above, Ly is maximised by choosing the optimal 
distribution Q. To approximate P(dj,|h(t,k)) using the value of Ly, we have to include the current 
decision dj in the set of visible variables, along with the visible variables specified in h(t,k). Then 
to estimate the conditional probability P(d/|h(t,k)), we need to normalise over the set of all possible 


value of di. Thus we need to compute a separate estimate maxg jh (d) for each possible value of 
di, =d: 


P(H' ,h(t,k), dy =d) 
Q(H'|h(t,k),d, =d)’ 





max Ly“ (d) = max J Q(H'|h(t, k), d; = d)ln 
H 


where H' = {S',...,S'}. Then P(d! = d|h(t,k)) can be approximated as the normalised exponential 
of p$ (d) values: 


exp(maxọ Ly“ (d)) 


P(di, = d|h(t,k)) = l 
SUS ea 





(9) 


It is not feasible to find the optimal distribution Q for SBNs, and mean field methods (Saul et al., 
1996; Saul and Jordan, 1999) use an additional approximation to estimate maxg LF (d). Even with 
this approximation, the maximum can be found only by using an iterative search procedure. This 
means that decoding estimator (7) requires performing this numerical procedure for every possible 
value of the next decision. Unfortunately, in general this is not feasible, in particular with labelled 
output structures where the number of possible alternative decisions dj, can be large. For our gener- 
ative model of natural language parsing, decisions include word predictions, and there can be a very 
large number of possible next words. Even if we choose not to recompute mean field parameters for 
all the preceding states S” t' <t, but only for the current state S’ (as proposed below), tractability 
still remains a problem.’ 

In our modifications of the mean field method, we propose to consider the next decision dj, as 
a hidden variable. Then the assumption of full factorisability of Q(H',d;|h(t,k)) is stronger than in 
the standard mean field theory because the approximate distribution Q is no longer conditioned on 
the next decision di. The approximate fully factorisable distribution Q(H|V) can be written as: 


1 f 


EOE 


ti 


where ue is the free parameter which determines the distribution of state variable i at position t’, 
namely its mean, and q;,(d;) is the free parameter which determines the distribution over decisions 
d}. Importantly, we use q (d) to estimate the conditional probability of the next decision: 


P(d; = d|h(t,k)) © q(d), 





3. We conducted preliminary experiments with natural language parsing on very small data sets and even in this setup 
the method appeared to be very slow and, surprisingly, not as accurate as the modification considered further in this 
section. 
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Figure 3: A graphical model fragment where variable A is a sink. 


and the total structure probability is therefore computed as the product of decision probabilities 
corresponding to its derivation: 


P(T)=P(D',...,D") ~ [úd (8) 
t,k 


5.2 A Feed-Forward Approximation 


In this section we will describe the sense in which neural network computation can be regarded as a 
mean field approximation under an additional constraint of strictly feed-forward computation. We 
will call this approximation the feed-forward approximation. As in the mean field approximation, 
each of the latent variables in the feed-forward approximation is independently distributed. But 
unlike the general case of mean field approximation, in the feed-forward approximation we only 
allow the parameters of every distribution Q(s* |h(t,k)) and Q(d{|h(t,k)) to depend on the approx- 
imate distributions of their parents, thus requiring that any information about the distribution of its 
descendants is not taken into account. This additional constraint increases the potential for a large 
KL divergence with the true model, but it significantly simplifies the computations. 

We start with a simple proposition for general graphical models. Under the feed-forward as- 
sumption, computation of the mean field distribution of a node in an ISBN is equivalent to compu- 
tation of a distribution of a variable corresponding to a sink in the graph of the model, that is, a node 
which does not have any outgoing edges. For example, node A is a sink in Figure 3. The following 
proposition characterises the mean field distribution of a sink. 


Proposition 1 The optimal mean field distribution of a sink A depends on the mean field distribution 
Q(B) of its hidden parents B = (B,,...,Bm) as 


Q(A = a) « exp(EglogP(A =alB,C)), 


where Q is the mean field distribution of hidden variables, P is the model distribution, C are visible 
parents of the node A and Eg denotes the expectation under the mean field distribution Q(B). 


This proposition is straightforward to prove by maximising the variational bound Ly (4) with 
respect to the distribution Q(A). Now we can use the fact that SBNs have log-linear CPD. By 
substituting their CPD given in expression (2) for P in the lemma statement, we obtain: 


Q(Si=1)=0( YY Siar), 


S;€Par(S;) 
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which exactly replicates computation of a feed-forward neural network with the logistic sigmoid 
activation function. Similarly, we can show that for variables with soft-max CPD, as defined in (3), 
their mean field distribution will be the log-linear function of their parents’ means. Therefore min- 
imising KL divergence under the constraint of feed-forward computation is equivalent to using 
log-linear functions to compute distributions of random variables given means of their parents. 

Now let us return to the derivation of the feed-forward approximation of ISBNs. As we just 
derived, under the feed-forward assumption, means of the latent vector S” are given by 


/ 1 
H =0(n;), 
where në is the weighted sum of the parent variables’ means: 


f= E Eia tEh O 


rt":r(t" t) j 


as follows from the definition of the corresponding CPD (5). 
The same argument applies to decision variables; the approximate distribution of the next deci- 
sion q (d) is given by 


Pave (d) exp(Lj Wa jee) 
Lat Pre x (d’) exp(D j War je) 





qld) (10) 


The resulting estimate of the probability of the entire structure is given by (8). 

This approximation method replicates exactly the computation of the feed-forward neural net- 
work model of Henderson (2003), where the above means ut are equivalent to the neural network 
hidden unit activations. Thus, that neural network probability model can be regarded as a simple 
approximation to the ISBN graphical model. 

In addition to the drawbacks shared by any mean field approximation method, this feed-forward 
approximation cannot capture bottom-up reasoning. By bottom-up reasoning, we mean the effect of 
descendants in a graphical model on distributions of their ancestors. For mean field approximations 
to ISBNs, it implies the need to update the latent vector means ue after observing a decision df, for 
t' < t. The use of edges directly from decision variables to subsequent latent vectors is designed to 
mitigate this limitation, but such edges cannot in general accurately approximate bottom-up reason- 
ing. The next section discusses how bottom-up reasoning can be incorporated in the approximate 
model. 


5.3 Incremental Mean Field Approximation 


In this section we relax the feed-forward assumption to incorporate bottom-up reasoning into the 
approximate model. Again as in the feed-forward approximation, we are interested in finding the 
distribution Q which maximises the quantity Ly in expression (4). The decision distribution qj, (dj) 
maximises Ly when it has the same dependence on the latent vector means i as in the feed-forward 
approximation, namely expression (10). However, as we mentioned above, the feed-forward com- 
putation does not allow us to compute the optimal values of state means pË. 

Optimally, after each new decision d;, we should recompute all the means ue for all the latent 
vectors S”, t' <t. However, this would make the method intractable for tasks with long decision 
sequences. Instead, after making each decision dj, and adding it to the set of visible variables 
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V, we recompute only the means of the current latent vector S’. This approach also speeds up 
computation because, unlike in the standard mean field theory, there is no need to introduce an 
additional variational parameter for each hidden layer variable s‘. 

The denominator of the normalised exponential function in (6) does not allow us to compute 
Ly exactly. Instead, we approximate the expectation of its logarithm by substituting Si with their 


means:* 


Eoln} Pay¢)(d) exp()) Wasi) © In) Proy (d) exp(Y) Wa ju), 
d j d j 


where the expectation is taken over the latent vector S’ distributed according to the approximate 
distribution Q. Unfortunately, even with this assumption there is no analytic way to maximise the 
approximation of Ly with respect to the means p/, so we need to use numerical methods. We can 
rewrite the expression (4) as follows, substituting the true P(H,V) defined by the graphical model 
and the approximate distribution Q(H|V), omitting parts independent of the means í: 


ik 
Ly =}, -ui lny — (1 — pi) In (1 — pi) +i; 
i 


+E War, fj —1n | $ Pre wd) exp(Y Waje) |, (11) 
k'<k j d J 


here, 1} is computed from the previous relevant state means and decisions as in (9). This expression 
is concave with respect to the parameters 4%, so the global maximum can be found. In the appendix, 
where we derive the learning algorithm, we show that the Hessian of this expression can be viewed 
as the negated sum of a positive diagonal matrix and some covariance matrices, thus implying the 
concavity of expression (11). We use coordinatewise ascent, where each pf is selected by a line 
search (Press et al., 1996) while keeping other 4i, fixed. 

Though we avoided re-computation of means of the previous states, estimation of the complex 
decision probability P(D'|h(t,k)) will be expensive if the decision D’ is decomposed in a large 
number of elementary decisions. As an example, consider a situation in dependency parsing, where 
after deciding to create a link, the parser might need to decide on the type of the link and, then, 
predict the part of speech type of the word and, finally, predict the word itself. The main reason 
for this complexity is the presence of the summation over k’ in expression (11), which results in 
expensive computations during the search for an optimal value of ui. This computation can be 
simplified by using the means of S’ computed during the estimation of P(dj,_,|h(t,k — 1)) as priors 
for the computation of the same means during the estimation of P(d;|h(t,k)). If we denote the 
means computed at an elementary step (t,k) as at, then for k = 1, minimisation of Le can be 
performed analytically, by setting 


H; = (Nj). (12) 





4. In initial research, we considered the introduction of additional variational parameters associated with every possible 
value of the decision variable in a way similar to Saul and Jordan (1999), but this did not improve the prediction 
accuracy of the model, and considerably increased the computational time. 
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For k > 1, expression (11) can be rewritten as: 
k k k k k 
LF =Y, -H inal — (1 -yn (1-1) 
i 
k kl kl k 
+u; (Ing; —In(1—y? )) + War ikl; 


—In (Fonna exp()) Wa h) ; (13) 
d j 


Note that maximisation of this expression is done also after computing the last decision K; for the 
state t. The resulting means p/*+! are then used in the computation of në for the relevant future 
states ¢’, that is, such f’ that r(t,t’) holds for some r: 


k 


rt:r(tt') j 


Concavity of expression (13) follows from concavity of (11), as their functional forms are dif- 
ferent by only a linear term and the presence of summation over the elementary decisions. See 
the appendix where we will show that the Hessian of L is negative semidefinite, confirming this 
statement. 


6. Learning and Decoding 


We train the models described in Sections 5.2 and 5.3 to maximise the fit of the approximate models 
to the data. We use gradient descent, and a maximum likelihood objective function. In order to 
compute the derivatives with respect to the model parameters, the error should be propagated back 
through the structure of the graphical model. For the feed-forward approximation, computation of 
the derivatives is straightforward, as in neural networks (Rumelhart et al., 1986). But for the mean 
field approximation, this requires computation of the derivatives of the means yu, with respect to the 
other parameters in expression (13). The use of a numerical search in the mean field approximation 
makes the analytical computation of these derivatives impossible, so a different method needs to 
be used to compute their values. The appendix considers the challenges arising when using maxi- 
mum likelihood estimation with the incremental mean field algorithm and introduces a modification 
of the error backpropagation algorithm for this model. For both approximations, their respective 
backpropagation algorithms have computational complexity linear in the length of a derivation. 
The standard mean field approach considered in Saul and Jordan (1999) maximised Ly (4) 
during learning, because Ly was used as an approximation of the log-likelihood of the training 
data. Ly is actually the sum of the log-likelihood and the negated KL divergence between the 
approximate distribution Q(H|V) and the SBN distribution P(H|V). Thus, maximising Ly will at 
the same time direct the SBN distribution toward configurations which have a lower approximation 
error. It is important to distinguish this regularisation of the approximate distribution from the 
Gaussian priors on the SBN parameters, which can be achieved by simple weight decay. We believe 
that these two regularisations should be complementary. However, in our version of the mean field 
method the approximate distributions of hidden decision variables q, are used to compute the data 
likelihood (8) and, thus, maximising this target function will not automatically imply KL divergence 
minimisation. Application of an additional regularisation term corresponding to minimisation of the 
KL divergence might be beneficial for our approach, and it could be a subject of further research. 
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Figure 4: Dynamic SBN used in artificial experiments. 


In our current experiments, we used standard weight decay, which regularises the SBN distribution 
with a Gaussian prior over weights. 


ISBNs define a probability model which does not make any a-priori assumptions of indepen- 
dence between any decision variables. As we discussed in Section 3, the use of relations based on 
the partial output structure makes it possible to take into account statistical interdependencies be- 
tween decisions closely related in the output structure, but separated by arbitrarily many positions 
in the input structure. In general, this property leads to the complexity of complete search being 
exponential in the number of derivation decisions. Fortunately, for many problems, such as natural 
language parsing, efficient heuristic search methods are possible. 


7. Experiments 


The goal of the evaluation is to demonstrate that ISBNs are an appropriate model for grammar learn- 
ing. Also, we would like to show that learning the mean field approximation derived in Section 5.3 
(IMF method) results in a sufficiently accurate model, and that this model is more accurate than the 
feed-forward neural network approximation (NN method) of Henderson (2003) considered in Sec- 
tion 5.2. First, we start with an artificial experiment where the training and testing data is known to 
have been generated by a SBN, and compare models based on each of the approximation methods. 
Second, we apply the models to a real problem, parsing of natural language, where we compare our 
approximations with state-of-the-art models. 


7.1 Artificial Experiment 


In order to have an upper bound for our artificial experiments, we do not consider incremental 
models, but instead use a dynamic sigmoid belief network, a first order Markovian model, and 
consider a sequence labelling task. This simplification allowed us to use Gibbs sampling from 
a true model as an upper bound of accuracy. The following generative story corresponds to the 
random dynamic SBNs: 
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Bill sells J N 
| | 


fresh oranges 


Figure 5: An example phrase structure tree. 


Draw initial state vector S! from a distribution of initial states P(S'). 
t=0. 
Do 
t=t+1, 
draw a label Y' from the distribution P(Y‘|S‘) as in (6), 
draw an input element X‘ from the distribution P(X'|Y‘,S’), 
draw the next latent state vector S’*! from P(S'*!|S",X",Y°), 
while Y* £0 and t < tmax- 


A graphical representation of this dynamic model is shown in Figure 4. Different weight ma- 
trices were used in the computation of P(X'|Y*, S") for each value of the label Y’. It is easy to see 
that this model is a special case of the ISBN graphical model, namely Figure 2 with non-Markovian 
dependencies removed. The state vector length was set to 5, the number of possible labels to 6, the 
number of distinct input elements to 8, the maximal length of each sequence tmax to 100. We per- 
formed 10 experiments.> For each of the experiments, we trained both IMF and NN approximations 
on a training set of 20,000 elements, and tested them on another 10,000 elements. Weight-decay 
and learning rate were reduced through the course of the experiments whenever accuracy on the 
development set went down. Beam search with a beam of 10 was used during testing. The IMF 
methods achieved average error reduction of 27% with respect to the NN method, where accuracy 
of the Gibbs sampler was used as an upper bound (average accuracies of 80.5%, 81.0%, and 82.3% 
for the NN, IMF, and sampler, respectively). 

The IMF approximation performed better than the NN approximation on 9 experiments out of 
10 (statistically significant in 8 cases).° These results suggest that the IMF method leads to a much 
more accurate model than the NN method when the true distribution is defined by a dynamic SBN. 
In addition, the average relative error reduction of even the NN approximation over the unigram 
model exceeded 60% (the unigram model accuracy was 77.4% on average), which suggests that 
both approximations are sufficiently accurate and learnable. 


7.2 Natural Language Parsing 


We compare our two approximations on the natural language phrase structure parsing task. The 
output structure is defined as a labelled tree, which specifies the hierarchical decomposition of a 





5. We preselected these 10 models to avoid random dynamic SBNs with trivial distributions. We excluded SBNs for 
which unigram model accuracy was within 3% of the Gibbs sampler accuracy, and where accuracy of the Gibbs 
sampler did not exceed 70%. All these constants were selected before conducting the experiments. 

6. In all our experiments we used the permutation test (Diaconis and Efron, 1983) to measure significance and consid- 
ered a result significant if p-value is below 5%. 
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root 

Decisions Stack |12 
1. Shift gin [root] Ee 
2. Project yp [root,Bill/N] 
3. Project s [root,NP] NP VP 10 
4. Shift senyv [root,S] z roS 
5. Projectyp [root, S, sells/V] N V NP 
6. Shift freshyy  [root,S,VP] ital Pe ea a 
7. Projectyp [root,S, VP, fresh/J] Bill sells J N 
8. ShiftorangesyN [root,S, VP,NP] 6| 8| 
9.—12. Attach [root,S,VP,NP,oranges/N],...,[root,S] fresh oranges 


Figure 6: Derivation for a constituent parse tree. 


sentence into phrases. An example of such a tree is presented on Figure 5, where the tree specifies 
that the adjective (J) fresh and the noun (N) oranges form a noun phrase (NP) “fresh oranges”, 
which, when combined with the verb (V) sells, forms the verb phrase (VP) “sells fresh oranges”. 
The hypothesis we wish to test here is that the more accurate approximation of ISBNs will result 
in a more accurate model of parsing. If this is true, then it suggests that ISBNs are a good abstract 
model for problems similar to natural language parsing, namely parsing problems which benefit 
from latent variable induction. 

We replicated the same definition of derivation and the same pattern of interconnection between 
states as described in Henderson (2003). For the sake of completeness we will provide a brief 
description of the structure of the model here, though more details can be found in Henderson 
(2003). 

The model uses a modification of the predictive LR order (Soisalon-Soininen and Ukkonen, 
1979), illustrated in Figure 6. In this ordering, a parser decides to introduce a node into the parse 
tree after the entire subtree rooted at the node’s first child has been fully constructed. Then the 
subtrees rooted at the remaining children of the node are constructed in their left-to-right order. The 
state of the parser is defined by the current stack of nodes, the queue of remaining input words and 
the partial structure specified so far. The parser starts with an artificial root element in the stack and 
terminates when it reaches a configuration with an empty queue and with the artificial root element 
on the top of the stack. The algorithm uses 3 main types of decisions: 


1. The decision Shift, shifts the word w from the queue to the stack. 


2. The decision Projecty replaces the current top of the stack X with a new node Y, and specifies 
that Y is the parent of X in the output structure. 


3. The decision Attach removes the current top of the stack X and specifies that element Y under 
the top of the stack is the parent of X. 


Though these three types of decisions are sufficient to parse any constituent tree, Henderson 
(2003) extends the parsing strategy to include a specific treatment of a particular configuration in 
the parse tree, Chomsky adjunction, using a version of the Attach decision called Modify. 

As was defined in expression (5), the probability of each state variable si in the ISBN depends on 
all the latent variables and previous relevant decisions in a subset of previous relevant positions t’ : 
r(t’,t). In this ISBN model for phrase structure parsing, we use the same pattern of interconnections 
between variables as in the neural network of Henderson (2003), where there are different relations 
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r(t',t) for selecting previous decision variables D” and for selecting previous latent variables Ss" 
Namely, the following four types of relations for selecting the previous positions t’ : r(t’,t) for 
latent variables S” are used: 


1. Stack Context: the last previous position with the same element on top of the stack as at 
current position f. 


2. Sub-Top of Stack: the last previous position where the node under the current top of the stack 
was on top of the stack. 


3. Left Child of Top of Stack: the last previous position where the leftmost child of the current 
stack top was on top of the stack. 


4. Right Child of Top of Stack: the last previous position where the rightmost child of the current 
stack top was on top of the stack. 


These relations were motivated by linguistic considerations and many of them have also been found 
useful in other parsing models (Johnson, 1998; Roark and Johnson, 1999). Also, this set of relations 
ensures that the immediately preceding state is always included somewhere in the set of connected 
states. This requirement ensures that information, at least theoretically, can pass between any two 
states in the decision sequence, thereby avoiding any hard independence assumptions. Also note that 
each relation only selects at most one position (the most recent one of that kind). This ensures that 
the number of such connections to a latent vector remains bounded at four, so it should generalise 
well across larger, more complex constituency structures. 

For selecting the previous positions t’ : r(t',t) for decision variables D”, the following relations 
are use: 


1. Previous: the previous position t — 1. 


2. Top: the position at which the current top of the stack was shifted (if it is a terminal) or 
introduced (if non-terminal). 


3. Last Shift: the position at which the last terminal was shifted. 


4. Left Terminal of Top of Stack: the position when the leftmost terminal dominated by the 
current stack top was shifted. 


This set includes the previous decision (Previous), which is important if the model does not do back- 
ward reasoning, as in the feed-forward approximation. The remaining relations pick out important 
labels, part-of-speech tags, and words in the context. 

We used the Penn Treebank Wall Street Journal corpus to perform the empirical evaluation 
of the considered approaches. It is expensive to train the IMF approximation on the whole WSJ 
corpus, so instead we both trained and tested the model only on sentences of length at most 15, as 
in Taskar et al. (2004); Turian et al. (2006); Finkel et al. (2008). The standard split of the corpus 
into training (9,753 sentences, 104,187 words), validation (321 sentences, 3,381 words), and testing 
(603 sentences, 6,145 words) was performed. 

As in Henderson (2003) and Turian and Melamed (2006) we used a publicly available tag- 
ger (Ratnaparkhi, 1996) to provide the part-of-speech tag for each word in the sentence. For each 
tag, there is an unknown-word vocabulary item which is used for all those words which are not 
sufficiently frequent with that tag to be included individually in the vocabulary. We only included 
a specific tag-word pair in the vocabulary if it occurred at least 20 time in the training set, which 
(with tag-unknown-word pairs) led to the very small vocabulary of 567 tag-word pairs. 
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R P F, 
Bikel, 2004 87.9 88.8 88.3 
Taskar et al., 2004 89.1 89.1 89.1 
NN method 89.1 89.2 89.1 
Turian et al., 2006 89.3 89.6 89.4 
IMF method 89.3 90.7 90.0 
Charniak, 2000 90.0 90.2 90.1 


Finkel et al., 2008, ‘feature-based? 91.1 90.2 90.6 





Table 1: Percentage labelled constituent recall (R), precision (P), combination of both (F1) on the 
testing set. 


For decoding, we use best-first search with the search space pruned in two different ways. First, 
only a fixed number of the most probable partial derivations are pursued after each word shift opera- 
tion. Secondly, the branching factor at each decision is limited. In the experiments presented in this 
chapter, we used the post-shift beam width of 10 and the branching factor of 5. Increasing the beam 
size and the branching factor beyond these values did not significantly effect parsing accuracy. For 
both of the models, the state vector length of 40 was used. All the parameters for both the NN and 
IMF models were tuned on the validation set. A single best model of each type was then applied to 
the final testing set. 

Table 1 lists the results of the NN approximation and the IMF approximation,” along with re- 
sults of different generative and discriminative parsing methods evaluated in the same experimental 
setup (Bikel, 2004; Taskar et al., 2004; Turian et al., 2006; Charniak, 2000; Finkel et al., 2008).8 
The IMF model improves over the baseline NN approximation, with a relative error reduction in 
F-measure exceeding 8%. This improvement is statistically significant. 

The IMF model achieves results which do not appear to be significantly different from the re- 
sults of the best model in the list (Charniak, 2000). Although no longer one of the most accurate 
parsing models on the standard WSJ parsing benchmark (including sentences of all lengths), the 
(Charniak, 2000) parser achieves competitive results (89.5% F-measure) and is still considered a 
viable approach, so the results reported here confirm the viability of our models. It should also be 
noted that previous results for the NN approximation to ISBNs on the standard WSJ benchmark 
(Henderson, 2003, 2004) achieved accuracies which are still competitive with the state of the art 
(89.1% F-measure for Henderson, 2003 and 90.1% F-measure for Henderson, 2004). For com- 
parison, the LPCFG model of Petrov et al. (2006) achieve 89.7% F-measure on the standard WSJ 
benchmark. 

We do not report the results on our data set of the LPCFG model of Petrov et al. (2006), probably 
the most relevant previous work on grammar learning (see the extended discussion in Section 8), as 
it would require tuning of their split-merge EM algorithm to achieve optimal results on the smaller 





7. Approximate training times on a standard desktop PC for the IMF and NN approximations were 140 and 3 hours, 
respectively, and parsing times were 3 and 0.05 seconds per token, respectively. Parsing with the IMF method could 
be made more efficient, for example by not requiring the numerical approximations to reach convergence. 

8. The results for the models of Bikel (2004) and Charniak (2000) trained and tested on sentences of length at most 15 
were originally reported by Turian and Melamed (2005). 
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data set. However, we note that the CRF-based model of Finkel et al. (2008) (the reported ‘feature- 
based’ version) and the LPCFG achieves very close results when trained and tested on the sentences 
of length under 100 (Finkel et al., 2008) and, therefore, would be expected to demonstrate similar 
results in our setting. Note also that the LPCFG decoding algorithm uses a form of Bayes risk 
minimisation to optimise for the specific scoring function, whereas our model, as most parsing 
methods in the literature, output the highest scoring tree (maximum a-posteriori decoding). In fact, 
approximate Bayes risk minimisation can be used with our model and in our previous experiments 
resulted in approximately 0.5% boost in performance (Titov and Henderson, 2006). We chose not 
to use it here, as the maximum a-posteriori decoding is simpler, more widely accepted and, unlike 
Bayes risk minimisation, is expected to result in self-consistent trees. 


These experimental results suggest that ISBNs are an appropriate model for structured predic- 
tion. Even approximations such as those tested here, with a very strong factorisability assumption, 
allow us to build quite accurate parsing models. We believe this provides strong justification for 
work on more accurate approximations of ISBNs. 


8. Additional Related Work 


Whereas graphical models are standard models for sequence processing, there has not been much 
previous work on graphical models for the prediction of structures more complex than sequences. 
Sigmoid belief networks were used originally for character recognition tasks, but later a Markovian 
dynamic extension of this model was applied to the reinforcement learning task (Sallans, 2002). 
However, their graphical model, approximation method, and learning method differ substantially 
from those of this paper. 


When they were originally proposed, latent variable models for natural language parsing were 
not particularly successful, demonstrating results significantly below the state-of-the-art models (Kuri- 
hara and Sato, 2004; Matsuzaki et al., 2005; Savova and Peshkin, 2005; Riezler et al., 2002) or they 
were used in combination with already state-of-the-art models (Koo and Collins, 2005) and demon- 
strated a moderate improvement. More recently several methods (Petrov et al., 2006; Petrov and 
Klein, 2007; Liang et al., 2007), framed as grammar refinement approaches, demonstrated results 
similar to the best results achieved by generative models. All these approaches considered exten- 
sions of a classic PCFG model, which augment non-terminals of the grammar with latent variables 
(Latent-annotated PCFGs, LPCFGs). Even though marginalisation can be performed efficiently 
by using dynamic programming, decoding under this model is NP-hard (Matsuzaki et al., 2005; 
Sima’an, 1992). Instead, approximate parsing algorithms were considered. 


The main reason for the improved performance of the more recent LPCFG methods is that 
they address the problem that with LPCFGs it is difficult to discover the appropriate latent variable 
augmentations for non-terminals. Early LPCFG models which used straight-forward implemen- 
tations of expectation maximisation algorithms did not achieve state-of-the-art results (Matsuzaki 
et al., 2005; Prescher, 2005). To solve this problem the split-and-merge approach was considered 
in Petrov et al. (2006); Petrov and Klein (2007) and Dirichlet Process priors in Liang et al. (2007). 
The model of Petrov and Klein (2007) achieved the best reported result for a single model parser 
(90.1% F-measure). Even with the more sophisticated learning methods, in all of the work on 
LPCFGs the number of latent annotations which are successfully learnt is small, compared to the 
40-dimensional vectors used in our experiments with ISBNs. 
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One important difference between LPCFGs and ISBNs is that in LPCFGs the latent annotations 
are used to expand the set of atomic labels used in a PCFG, whereas ISBNs directly reason with a 
vector of latent features. This use of a vector space instead of atomic labels provides ISBNs with a 
much larger label space with a much richer structure of similarity between labels, based on shared 
features. This highly structured label space allows standard gradient descent techniques to work 
well even with large numbers of latent features. In contrast, learning for LPCFGs has required the 
specialised methods discussed above and has succeeded in searching a much more limited space 
of latent annotations. These specialised methods impose a hierarchical structure of similarity on 
the atomic labels of LPCFGs, based on recursive binary augmentations of labels (“splits”), but this 
hierarchical structure is much less rich that the similarity structure of a vector space. 


Another important difference with LPCFGs is that ISBN models do not place strong restrictions 
on the structure of statistical dependencies between latent variables, such as the context-free restric- 
tion of LPCFGs. This makes ISBNs easily applicable to a much wider set of problems. For example, 
ISBNs have been applied to the dependency parsing problem (Titov and Henderson, 2007) and to 
joint dependency parsing and semantic role labelling (Henderson et al., 2008; Gesmundo et al., 
2009), where in both cases they achieved state-of-the-art results. The application of LPCFG mod- 
els to even dependency parsing has required sophisticated grammar transformations (Musillo and 
Merlo, 2008), to which the split-and-merge training approach has not yet been successfully adapted. 


The experiments reported in Henderson et al. (2008) also suggest that the latent annotations 
of syntactic states are not only useful for syntactic parsing itself but also can be helpful for other 
tasks. In these experiments, semantic role labelling performance rose by about 3.5% when latent 
annotations for syntactic decision were provided, thereby indicating that the latent annotation of 
syntactic parsing states helps semantic role labelling. 


9. Conclusions 


This paper proposes a new class of graphical models for structured prediction problems, incremen- 
tal sigmoid belief networks, and has applied it to natural language grammar learning. ISBNs allow 
the structure of the graphical model to be dependent on the output structure. This allows the model 
to directly express regularities that are local in the output structure but not local in the input struc- 
ture, making ISBNs appropriate for parsing problems. This ability supports the induction of latent 
variables which augment the grammatical structures annotated in the training data, thereby solving 
a limited form of grammar induction. Exact inference with ISBNs is not tractable, but we derive 
two tractable approximations. First, it is shown that the feed-forward neural network of Henderson 
(2003) can be considered as a simple approximation to ISBNs. Second, a more accurate but still 
tractable approximation based on mean field theory is proposed. 


Both approximation models are empirically evaluated. First, artificial experiments were per- 
formed, where both approximations significantly outperformed a baseline. The mean field method 
achieved average relative error reduction of about 27% over the neural network approximation, 
demonstrating that it is a more accurate approximation. Second, both approximations were applied 
to the natural language parsing task, where the mean field method demonstrated significantly bet- 
ter results. These results are non-significantly different from the results of another history-based 
probabilistic model of parsing (Charniak, 2000) which is competitive with the state-of-the-art for 
single-model parsers. The fact that a more accurate approximation leads to a more accurate parser 
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suggests that the ISBNs proposed here are a good abstract model for grammar learning. This em- 
pirical result motivates further research into more accurate approximations of ISBNs. 
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Appendix A. 


This appendix presents details of computing gradients for the incremental mean field approxima- 
tion. We perform maximum likelihood estimation of the ISBN parameters, using the estimator of 
the structure probability defined in expression (8). We focus on the incremental mean field approxi- 
mation introduced in Section 5.3. As we have shown there, estimates of the conditional distribution 
qi(d) ~ P(di, = d|h(t,k)) are dependent on the means y* computed at the elementary step (t, k) in 
the same way as the estimates qf (d) in the feed-forward approximation depend on the means p in 
expression (10), that is, 


Prix (a) exp(L Waje") 
Lar Pie) (a) exp(L; War jki“) 


We use the gradient descent algorithm, so the goal of this section is to describe how to compute 
derivatives of the log-likelihood 


LIT) = LY Wa jot — log [x Pre) (@')exp()) vou) 
. d' j 





qld) = (15) 


tk j 


with respect to all the model parameters. The derivatives of Ĉ(T) with respect to model parameters 
can be expressed as 


T) dWaj 


-r% my (T) dui 
er dx ee oui" dx ’ 





(16) 


where x is any model parameter, that is, entries of the weight matrices J, B and W. All the terms 


t,k 





a 
except for are trivial to compute: 
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t,k 
i 


dx 
The main challenge is that dependence of a for k > 1 on other model parameters cannot be ex- 





where 6;; is the Kronecker delta. Computation of the total derivatives is less straightforward. 


pressed analytically, as we found values of u“ by performing numerical maximisation of the ex- 


pression ie (13). In the next several paragraphs we will consider only the case of k > 1, but later 
we will return to the simpler case of k = 1, where the computation of derivatives is equivalent to the 
backpropagation algorithm in standard feed-forward neural networks. 

Note that the gradient of the log-likelihood can be easily computed in the standard mean field 
methods for SBNs (Saul and Jordan, 1999; Saul et al., 1996), even though they also use numeric 
strategies to find optimal means. There means are selected so as to maximise the variational upper 
bound Ly (4), which is used as the log-likelihood Ê = Ly in their approach. In static SBNs it 
is feasible to perform complete maximisation of the entire L, which involves multiple backward- 
forward passes through the structure of the graphical model. This leads to all the derivatives a 
being equal to zero. Therefore, no error backpropagation is needed in their case. All the derivatives 
dL can be computed using variational parameters associated with the nodes corresponding to the 
parameter x. E.g. if x is a weight of an edge then only variational parameters associated with 
the variables at its ends are needed to compute the derivative. Unfortunately, learning with the 
incremental mean field approximation proposed in this paper is somewhat more complex. 


dui 
dx 


gence, then the partial derivatives of Le with respect to we are equal to zero. This gives us a system 
of linear equations, which describes interdependencies between the current means u/*, the previous 
means p’*—! and the weights W: 
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In order to compute derivatives we assume that maximisation of Lọ is done until conver- 
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for 1 <i <n, where @j,_, is the distribution over decisions computed in the same way as qi,_, (15), 
but using means ig instead of isms 
k 
Page x1) (d)exp (X; Wa je ) 
= = 
Lar Drar) (a!) exp(Lj Warji") 


This system of equations permits the use of implicit differentiation to compute the derivatives 
k-1 





1 (d) 


JÉ * 3 ; : t $ i 
a , where z can be a weight matrix component W4; or a previous mean u? involved in ex- 


pression (13). It is important to distinguish z from x, used above, because x can be an arbitrary 
model parameter not necessary involved in the expression $ but affecting the current means i 


t.k-1 dpi” 
j 


. Equally important to distinguish partial derivatives -5— from the total derivatives 
at because the dependency of i" on parameter z can be both direct, through maximisation of jb 





through u 





Zz 


yt 
but also indirect through previous maximisation steps (¢’,k’), where rhe was dependent on z. The 
relation between the total and partial derivatives can be expressed as 
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meaning that indirect dependencies of ue (k > 1) on parameters z are coming through previous 
t,k—1 

















means 4i} . We apply the implicit differentiation theorem and obtain the vector of partial deriva- 
Ont Out, 
tives with respect to a parameter z Du! = {>1,..., Mi has 
AT! k 
pr- - (Dy F" ) D.F'*, (18) 
where D,:F'*« and D-F'* are Jacobians: 
ari* ori* aF! 
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Now we derive the Jacobians D, F' * and D_F'* for different types of parameters z. The matrix 
Dyu F tK consists of the components 


t,k 
OF, = Õ;j 

k k k 
Ont; we (1 — i") 


+ (gaama) (Ea. awa) ; (19) 


where ô; jis the Kronecker delta. If we consider W.; as a random variable accepting values Wz; under 
distribution g,_,;, we can rewrite the Jacobian D,:F tK as the negated sum of a positive diagonal 
matrix and the covariance matrix Xg _, (W). Therefore the matrix D, F‘ K is negative semidefinite. 





Y âr- (d)WaiWa; 
d 


Note that this matrix is the Hessian for the expression r$ (13), which implies concavity of Ls 
stated previously without proof. Similarly, the Hessian for (11) is only different by including output 
weight covariances for all the previous elementary decision, not only for the last one, and therefore 
expression (11) is also concave. 








To conclude with the computation of au =» we compute D, px- F° tk and DwFt*: 
ori õ;; 
AA Spal E (20) 
d aE). 
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Now the partial derivatives at i- and 5 : fi ; can be computed by substituting expressions (19)-(21) 


into (18). 
For k= 1, ie was shown to be equal to the sigmoid function of the weighted sum of the parents 








means as defined in (12) and (14). Therefore, we can compute the partial derivatives of Tig with 
respect to other means and parameters involved in (12) and (14): 
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a in (16), derivatives with respect to previous means ant are used to 





In order to compute 


propagate the error in a similar way to the neural network backpropagation algorithm (Rumelhart 
et al., 1986). We denote the total derivative of the approximate log-likelihood with respect to the 
dL(T 
ae 
antees that latent vectors of means u/* are computed from the means of the previous elementary 
steps. Therefore, values gi can be computed in the opposite order, propagating the information 
back through the structure. Namely, the recursive formulae would be: 


means of the latent variables as gi = 





. The incrementality of the mean field algorithm guar- 
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After computing values £ for all the elementary steps (t,k), we can evaluate the derivatives of 
the model parameters. We start with the output distribution parameters W4;: 
di(T) OoL(T) 


out 
= 4 g1% J 
dWai Wai » L OWai 








aut 
The first term here is evaluated as defined in (17), the term wie is computed as explained above. 


Finally, the total derivatives of the log-likelihood with respect to the parameters J} j and Bik are 
found as follows 
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